{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 词表征(Word Representation)\n",
    "文本数据经过预处理后，需要转化成数值特征，便于后续处理"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**one-hot 表征** (localist representation)：\n",
    "- 例如给定词典，包含$10^7$个单词，则每个单词表示为 $10^7\\times 1$ 的 one-hot 向量；“我们”表示为 $[0,0,...,0,0,1,0,0,0...0,0,0]_{(10^7,1)}$，1 的索引为“我们”在词典中的位置\n",
    "- 句子或文档向量则可表示为 $10^7\\times 1$ 的向量，词典中每个单词在句子或文档中是否出现；Bolean 向量。\n",
    "- 也可以表示为，词典中每个单词在句子或文档中出现的次数；Count-based 句子向量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'The': 1,\n",
       "  'cat': 2,\n",
       "  'sat': 3,\n",
       "  'on': 4,\n",
       "  'the': 5,\n",
       "  'mat.': 6,\n",
       "  'dog': 7,\n",
       "  'ate': 8,\n",
       "  'my': 9,\n",
       "  'homework.': 10},\n",
       " array([[[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n",
       "         [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],\n",
       "         [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],\n",
       "         [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],\n",
       "         [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],\n",
       "         [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],\n",
       "         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n",
       "         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n",
       "         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n",
       "         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],\n",
       " \n",
       "        [[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n",
       "         [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],\n",
       "         [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],\n",
       "         [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],\n",
       "         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],\n",
       "         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n",
       "         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n",
       "         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n",
       "         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n",
       "         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 单词级 one-hot 编码\n",
    "import numpy as np\n",
    "samples = ['The cat sat on the mat.', 'The dog ate my homework.']\n",
    "token_index = {}\n",
    "for sample in samples:\n",
    "    for word in sample.split():\n",
    "        if word not in token_index:\n",
    "            token_index[word] = len(token_index) + 1\n",
    "\n",
    "max_length = 10\n",
    "results = np.zeros(shape=(len(samples), max_length, len(token_index) + 1))\n",
    "for i, sample in enumerate(samples):\n",
    "    for j, word in list(enumerate(sample.split()))[:max_length]:\n",
    "        index = token_index.get(word)\n",
    "        results[i, j, index] = 1.\n",
    "\n",
    "token_index, results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'0': 1,\n",
       "  '1': 2,\n",
       "  '2': 3,\n",
       "  '3': 4,\n",
       "  '4': 5,\n",
       "  '5': 6,\n",
       "  '6': 7,\n",
       "  '7': 8,\n",
       "  '8': 9,\n",
       "  '9': 10,\n",
       "  'a': 11,\n",
       "  'b': 12,\n",
       "  'c': 13,\n",
       "  'd': 14,\n",
       "  'e': 15,\n",
       "  'f': 16,\n",
       "  'g': 17,\n",
       "  'h': 18,\n",
       "  'i': 19,\n",
       "  'j': 20,\n",
       "  'k': 21,\n",
       "  'l': 22,\n",
       "  'm': 23,\n",
       "  'n': 24,\n",
       "  'o': 25,\n",
       "  'p': 26,\n",
       "  'q': 27,\n",
       "  'r': 28,\n",
       "  's': 29,\n",
       "  't': 30,\n",
       "  'u': 31,\n",
       "  'v': 32,\n",
       "  'w': 33,\n",
       "  'x': 34,\n",
       "  'y': 35,\n",
       "  'z': 36,\n",
       "  'A': 37,\n",
       "  'B': 38,\n",
       "  'C': 39,\n",
       "  'D': 40,\n",
       "  'E': 41,\n",
       "  'F': 42,\n",
       "  'G': 43,\n",
       "  'H': 44,\n",
       "  'I': 45,\n",
       "  'J': 46,\n",
       "  'K': 47,\n",
       "  'L': 48,\n",
       "  'M': 49,\n",
       "  'N': 50,\n",
       "  'O': 51,\n",
       "  'P': 52,\n",
       "  'Q': 53,\n",
       "  'R': 54,\n",
       "  'S': 55,\n",
       "  'T': 56,\n",
       "  'U': 57,\n",
       "  'V': 58,\n",
       "  'W': 59,\n",
       "  'X': 60,\n",
       "  'Y': 61,\n",
       "  'Z': 62,\n",
       "  '!': 63,\n",
       "  '\"': 64,\n",
       "  '#': 65,\n",
       "  '$': 66,\n",
       "  '%': 67,\n",
       "  '&': 68,\n",
       "  \"'\": 69,\n",
       "  '(': 70,\n",
       "  ')': 71,\n",
       "  '*': 72,\n",
       "  '+': 73,\n",
       "  ',': 74,\n",
       "  '-': 75,\n",
       "  '.': 76,\n",
       "  '/': 77,\n",
       "  ':': 78,\n",
       "  ';': 79,\n",
       "  '<': 80,\n",
       "  '=': 81,\n",
       "  '>': 82,\n",
       "  '?': 83,\n",
       "  '@': 84,\n",
       "  '[': 85,\n",
       "  '\\\\': 86,\n",
       "  ']': 87,\n",
       "  '^': 88,\n",
       "  '_': 89,\n",
       "  '`': 90,\n",
       "  '{': 91,\n",
       "  '|': 92,\n",
       "  '}': 93,\n",
       "  '~': 94,\n",
       "  ' ': 95,\n",
       "  '\\t': 96,\n",
       "  '\\n': 97,\n",
       "  '\\r': 98,\n",
       "  '\\x0b': 99,\n",
       "  '\\x0c': 100},\n",
       " array([[[0., 0., 0., ..., 0., 0., 0.],\n",
       "         [0., 0., 0., ..., 0., 0., 0.],\n",
       "         [0., 0., 0., ..., 0., 0., 0.],\n",
       "         ...,\n",
       "         [0., 0., 0., ..., 0., 0., 0.],\n",
       "         [0., 0., 0., ..., 0., 0., 0.],\n",
       "         [0., 0., 0., ..., 0., 0., 0.]],\n",
       " \n",
       "        [[0., 0., 0., ..., 0., 0., 0.],\n",
       "         [0., 0., 0., ..., 0., 0., 0.],\n",
       "         [0., 0., 0., ..., 0., 0., 0.],\n",
       "         ...,\n",
       "         [0., 0., 0., ..., 0., 0., 0.],\n",
       "         [0., 0., 0., ..., 0., 0., 0.],\n",
       "         [0., 0., 0., ..., 0., 0., 0.]]]))"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 字符级 one-hot 编码\n",
    "import string\n",
    "samples = ['The cat sat on the mat.', 'The dog ate my homework.']\n",
    "\n",
    "characters = string.printable\n",
    "token_index = dict(zip(characters, range(1, len(characters) + 1)))\n",
    "max_length = 50\n",
    "\n",
    "results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))\n",
    "for i, sample in enumerate(samples):\n",
    "    for j, character in enumerate(sample):\n",
    "        index = token_index.get(character)\n",
    "        results[i, j, index] = 1\n",
    "        \n",
    "token_index, results        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "({'the': 1,\n",
       "  'cat': 2,\n",
       "  'sat': 3,\n",
       "  'on': 4,\n",
       "  'mat': 5,\n",
       "  'dog': 6,\n",
       "  'ate': 7,\n",
       "  'my': 8,\n",
       "  'homework': 9},\n",
       " array([[0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
       "         0., 0., 0., 0.],\n",
       "        [0., 1., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.,\n",
       "         0., 0., 0., 0.]]))"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# tf 实现 one-hot 编码，句子的 Bolean 向量\n",
    "from tensorflow.keras.preprocessing.text import Tokenizer\n",
    "samples = ['The cat sat on the mat.', 'The dog ate my homework.']\n",
    "\n",
    "tokenizer = Tokenizer(num_words=20) # 只考虑前20个最常见的单词\n",
    "tokenizer.fit_on_texts(samples)\n",
    "\n",
    "sequences = tokenizer.texts_to_sequences(samples)  # 整数索引列表\n",
    "one_hot_results = tokenizer.texts_to_matrix(samples,\n",
    "                                            mode='binary')  # one-hot 向量\n",
    "word_index = tokenizer.word_index  # 单词-索引词典\n",
    "\n",
    "word_index, one_hot_results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用**散列技巧**的单词级 one-hot 编码\n",
    "   - 当词汇表非常大大时，散列技巧降低向量长度\n",
    "   - 可能出现散列冲突"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "samples = ['The cat sat on the mat.', 'The dog ate my homework.']\n",
    "\n",
    "# 如词汇表有 10000 个单词时，将向量长度从 10000 降低到 1000\n",
    "dimensionality = 1000\n",
    "max_length = 10\n",
    "\n",
    "results = np.zeros(shape=(len(samples), max_length, dimensionality))\n",
    "for i, sample in enumerate(samples):\n",
    "    for j, word in list(enumerate(sample.split()))[:max_length]:\n",
    "        index = abs(hash(word)) % dimensionality\n",
    "        results[i, j, index] = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$tf-idf$ 表征\n",
    "\n",
    "$$tfidf(w)=tf(d,w)\\times idf(w)$$\n",
    "\n",
    "- $tf(d,w)$ - 文档 d 中单词 w 的词频；\n",
    "- $idf(w)=log\\frac{N}{N_{w}}$，$N$ - 语料库的文档总数，$N_{w}$ - 词语 w 出现在多少个文档中"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "matrix([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,\n",
       "         0.        , 0.38408524, 0.        , 0.38408524],\n",
       "        [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,\n",
       "         0.53864762, 0.28108867, 0.        , 0.28108867],\n",
       "        [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,\n",
       "         0.        , 0.26710379, 0.51184851, 0.26710379],\n",
       "        [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,\n",
       "         0.        , 0.38408524, 0.        , 0.38408524]])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "corpus = [\n",
    "    'This is the first document.',\n",
    "    'This document is the second document.',\n",
    "    'And this is the third one.',\n",
    "    'Is this the first document?',\n",
    "]\n",
    "vectorizer = TfidfVectorizer()\n",
    "X = vectorizer.fit_transform(corpus)\n",
    "X.todense()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']\n"
     ]
    }
   ],
   "source": [
    "print(vectorizer.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`localist representation` 的缺点：\n",
    "1. 稀疏性表征，向量长度为词典内单词总数\n",
    "2. 无法表征单词的相似度，任意两个单词向量的点积为 0\n",
    "3. 表达能力弱，如无法表达语义；同样 8 维向量，one-hot只能表示 8 个单词，而词向量每个元素取值连续，因此能表示无穷个单词"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 词向量"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 分布语义(distributional Semantic)，特定单词的含义由频繁出现在其前后的词决定 $\\Longrightarrow$ 分布式表征，通过单词的上下文(context)表征单词。\n",
    "    \n",
    "       \n",
    "- 不同的单词，其上下文越相近，该单词的含义越相似。例如单词 w，其上下文由其前 n 个单词及其后 n 个单词的集合组成，n 表示窗口尺寸(fixed window)。如下所示，单词 \"banking\" 的上下文，窗口尺寸 $n=5$ :\n",
    "$$\n",
    "\\begin{align*}\n",
    "...goverment\\ debt\\ problems\\ turning\\ into\\ &\\boxed{banking}\\ crisises\\ as\\ happened\\ in\\ 2009 ...\\\\\n",
    "...saying\\ that\\ Europe\\ needs\\ unified\\ &\\boxed{banking}\\ regulation\\ to\\ replace\\ the\\ hodgepodge...\\\\\n",
    "...Indian\\ has\\ just\\ given\\ its\\ &\\boxed{banking}\\ system\\ a\\ shot\\ in\\ the\\ arm...\n",
    "\\end{align*}\n",
    "$$\n",
    "     \n",
    "       \n",
    "- 词向量，也称为 word embedding 或 word representation，将文本映射到特定维度的向量空间（vector space）中：\n",
    "    1. 更致密的向量，\n",
    "    2. 点积表示不同词间的相似性，表达语义\n",
    "    3. capacity，表达能力强，词向量每个元素取值连续，因此能表示无穷个单词\n",
    "    4. 泛化能力强，即在测试集上也有很好的效果，可用于迁移学习进行后续操作"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 训练词向量原理：\n",
    "- 大量语料库，从中得出固定的词汇表，每个单词通过一个向量表示，向量维度 50、200或300\n",
    "- 按语料库中文本的顺序，遍历每一个位置，以该位置单词作为中心词 c，选定窗口尺寸 n，窗口内上下文单词为 o\n",
    "- 利用词向量的相似性，计算给定 c 时，o 的概率，或相反；c 的词向量就可以用来计算 o 中的单词\n",
    "- 不断调整词向量，最大化概率"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `Skip-Gram Model`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 将文本按顺序，将固定窗口尺寸 n=2 内的单词，转变成 中心词-上下文词 组成的词对；中心词将作为输入，上下文词将作为训练标签。词汇表，10000个单词\n",
    "\n",
    "  <img src=\"../images/词向量训练样本.png\" alt=\"skip-gram\" width=\"80%;\" />\n",
    "\n",
    "- 构建如下图神经网络，输入为中心词 ‘ants’ 的 one-hot 向量；隐藏层矩阵本质为词向量查找表 ($10000\\times 300$)，通过 one-hot 向量查找到 ’ants‘ 的词向量 ($300,$)；再经过输出层矩阵 ($300\\times 10000$)的处理，得到 ($10000,$) 的向量，为词汇表中每个单词的概率分布。\n",
    "\n",
    "  <img src=\"../images/词向量神经网络.png\" alt=\"词向量神经网络\" width=\"80%;\" />\n",
    "\n",
    "- 例如，输入词对，（’ants‘，’car‘），训练目标：使得 ‘ants' 的上下文词  ’car‘  的概率最大，而其它非上下文词的概率最小。最终得到隐藏层的权重矩阵即为所要得到的词向量，输出层矩阵则丢弃。\n",
    "\n",
    "  <img src=\"../images/词向量输出层png.png\" alt=\"词向量输出层\" width=\"80%;\" />\n",
    "\n",
    "- 输出层有 ($300\\times 10000$) 个参数，且 softmax 函数还需要计算 $\\sum_i^{10000}{e^i}$ ，模型最终的计算量会非常大"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CBOW模型\n",
    "- 与 skip-gram 相反，以 上下文词 作为输入，来训练模型，以最大化中心词的概率   \n",
    "<img src=\"../images/CBOW.PNG\" width=\"60%;\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-08T07:54:14.126424Z",
     "start_time": "2020-04-08T07:54:14.117148Z"
    }
   },
   "source": [
    "## 重采样\n",
    "在语料中频繁出现的一些单词，对中心词并没有给出多大信息，如 (`fox`, `the`) 词对中的 `the`。Word2Vec 实现了 重采样机制，以解决该问题\n",
    "\n",
    "- 给定单词 $w_i$ ，其在语料库($10^6$个单词)中的词频为 $z(w_i)$\n",
    "\n",
    "- `sample` 参数决定出现多少重采样，值越小，单词越不可能被选择，sample=0.001\n",
    "\n",
    "- 重采样单词 $w_i$ 的概率为：\n",
    "\n",
    "  $$P(w_i)=(\\sqrt{\\frac{z(w_i)}{0.001}}+1)\\cdot \\frac{0.001}{z(w_i)}$$\n",
    "\n",
    "<img src=\"../images/重采样概率分布.png\" alt=\"重采样概率分布\" width=\"65%\">\n",
    "\n",
    "   - $P(w_i)=1.0$，当单词词频 $z(w_i)<=0.0026$ 时，该单词 100% 会被选择\n",
    "   - $P(w_i)=0.5$，当单词词频 $z(w_i)<=0.00746$ 时，该单词有 50% 可能性会被选择\n",
    "   - $P(w_i)=0.033$，当单词词频 $z(w_i)<=1.0$ 时，该单词 3.3% 会被选择\n",
    "        - 即整个语料库由该单词组成\n",
    "        \n",
    "        \n",
    "重采样后，频繁出现的单词在采样时被选择的概率降低，而稀有词在采样时不会被忽略        \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 加速训练\n",
    "\n",
    "由于输出层的$softmax$函数会输出整个词汇表中所有单词的的概率，计算量非常大，两种方法加快训练速度：\n",
    "   - `Hierarchical Softmax`\n",
    "   - `Negative sampling`\n",
    "   \n",
    "   \n",
    "   \n",
    "## `Hierarchical Softmax`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-08T07:24:38.747944Z",
     "start_time": "2020-04-08T07:24:38.735945Z"
    }
   },
   "source": [
    "\n",
    "- 基于层级$softmax$的模型，$softmax$的输出不再是词汇表所有单词的概率分布，而是一颗`Huffman`树，其中叶子节点共`N`个，对应于`N`个单词，非叶子节点`N-1`个\n",
    "\n",
    "#### 霍夫曼树：     \n",
    "- 霍夫曼树(`Huffman Tree`)，一种满二叉树，其带权路径长最小，也称为最优二叉树\n",
    "    - 带权路径长，指的是所有叶子节点的权值与路径长的乘积之和；\n",
    "    - 若叶子节点的权值都是已知的，使权值越大的叶子节点路径越小，则整棵树的带权路径长最小。\n",
    "- 构造霍夫曼树的目的是为了完成霍夫曼编码，变长、极少多余编码方案，使得频率高的字符对应的二进制位较短，频率低的字符对应的二进制位较长。相对于等长编码，将文件中每个字符转换为固定个数的二进制位\n",
    "\n",
    "                                     \n",
    "> 霍夫曼树的构造，给定包含权重值为 $(w_1,w_2,...w_n)$  的节点列表：\n",
    "   - 将节点列表按权重进行升序排列，并构建一颗空树\n",
    "   - 遍历排序后的节点，依次将其添加到树中，添加规则如下：\n",
    "      - 若树为空，则作为根节点\n",
    "      - 若权重大于根节点权重，则该节点作为根节点右兄弟，生成一个新的根节点，权重为两者之和\n",
    "      - 若权重不大于根节点权重，则该节点作为根节点左兄弟，生成一个新的根节点，权重为两者之和\n",
    "   - 约定霍夫曼树左子树编码为0，右子树编码为1，左子树的权重小于右子树的权重；\n",
    "<img src=\"../images/霍夫曼树.png\" alt=\"霍夫曼树\" width=\"65%\">\n",
    "   - 权重越高的节点编码值越短，权重越低的编码值越长；保证了越常用的单词拥有越短的编码。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 词汇表构建的霍夫曼树\n",
    "- 首先根据语料库，构建词典$V$，然后以每个单词的词频为权重构建霍夫曼树，\n",
    "  - 出现越频繁的单词在树中深度越浅，越少见的词在越深；每个单词有其对应的由‘01’组成的编码\n",
    "  - 如下图所示：<img src=\"../images/vocab_huffman.png\" width=\"60%\">\n",
    "  - 树中叶子节点即为每个单词，内部节点为模型参数\n",
    "  \n",
    "  \n",
    "- 在训练词对 ”深度：学习“ 时，目标为 “学习”，其路径为 `643`，编码为 `101`；\n",
    "    - 因此输入 ”深度“ 的词向量，经节点 6 处理，需要输出值 1；经节点 4 处理需要输出值 0；经过节点 3 处理，需要输出值 1\n",
    "<img src=\"../images/navi_huffman.png\" width=\"60%\">\n",
    "    - 输入词向量与 目标单词 的霍夫曼树路径中每个节点参数求 **点积，然后`sigmoid`激活** 处理得到的结果，要与每个编码值相同\n",
    "    - 训练完成后，每个单词对应的词向量即为目标数据，霍夫曼树的节点参数被丢弃\n",
    "    \n",
    "    \n",
    "- 训练完成后，再次输入单词 ”深度“ 的词向量，经过节点 6 处理得到概率 0.55；\n",
    "    - 因为训练时，输入为 “深度”，目标单词会有很多，概率 0.55，说明 “深度”的目标单词中 55% 出现在节点 6 的右边，45% 出现在左边；\n",
    "    - “深度：的” 词对占所有 “深度”词对的 0.45*0.75\n",
    "<img src=\"../images/huffman_prob.png\" width=\"60%\">\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### `Hierarchical Softmax`:\n",
    "\n",
    "\n",
    "\n",
    "- CBOW 模型中，给定上下文词  o，最大化中心词 c 的条件概率 $P(c|o)$\n",
    "\n",
    "  - $X_o$：上下文词向量平均后的向量，作为树的输入\n",
    "  - $p$ ：表示从根节点出发到中心词 c 所代表的叶节点的路径\n",
    "  - $l$：表示路径包含的节点的个数\n",
    "  - $p_1, p_2, ...p_l$：表示路径 $p$ 中的节点\n",
    "  - $d_2 ,d_3, ...d_l \\in{\\{0,1\\}}$：表示路径 $p$ 中每个节点的编码，0 或 1，根节点无编码\n",
    "  - $\\theta_1, \\theta_2,...\\theta_{l-1}$：表示路径 $p$ 中非叶节点对应的参数向量\n",
    "\n",
    "  中心词的条件概率，可以看作从根节点到中心词对应的叶子节点的路径的概率，该路径是唯一的；在树中选择左分支还是右分支，即可看作一个二分类问题，选择每个节点的概率：\n",
    "$$\n",
    "\\begin{align}P(d_j|X_o,\\theta_{j-1})&=\\begin{cases}\\sigma(X_o^T\\theta_{j-1}),  &d_j=0 \\\\1-\\sigma(X_o^T\\theta_{j-1}),  &d_j=1\\end{cases}\\\\&=\\left[\\sigma(X_o^T\\theta_{j-1})\\right]^{1-d_j}\\cdot \\left[1-\\sigma(X_o^T\\theta_{j-1})\\right]^{d_j}\\end{align}\n",
    "$$\n",
    "  中性词的概率，即为到中心词节点的路径中所有节点的概率的连乘：\n",
    "  $$\n",
    "  P(c|o)=\\prod_{j=2}^{l}P(d_j|X_o,\\theta_{j-1})\n",
    "  $$\n",
    "\n",
    "  最大化模型的目标函数：\n",
    "  $$\n",
    "  \\begin{align}L&=\\sum_{c\\in V}{logP(c|o)}\\\\&=\\sum_{c\\in V}\\sum_{j=2}^{l}\\Big\\{(1-d_j)\\cdot log\\left[\\sigma(X_o^T\\theta_{j-1})\\right]+d_j\\cdot log\\left[1-\\sigma(X_o^T\\theta_{j-1})\\right]\\Big\\}\\\\&=\\sum_{c\\in V}\\sum_{j=2}^{l}L(o,j)\\end{align}\n",
    "  $$\n",
    "  最大化目标函数中的每一项，两个参数，节点参数 $\\theta_{j-1}$，输入的向量 $X_o$，设定学习速率 $\\eta$。梯度下降法更新这两个参数：\n",
    "  $$\n",
    "  \\begin{align}\\theta_{j-1}&:\\theta_{j-1}+\\eta\\cdot\\frac{\\partial L(o,j)}{\\partial\\theta_{j-1}}\\\\X_o&:X_o+\\eta\\cdot\\sum_{j=2}^{l}\\frac{\\partial L(o,j)}{\\partial X_o}\\end{align}\n",
    "  $$\n",
    "  $X_o$ 为多个上下文词向量的平均，word2vec 直接将更新量应用到对应的那几个上下文词向量上"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `Negative Sampling`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "输出时，不再是计算整个词汇表的概率分布；而是从词汇表中选择一定量的负样本，再加上目标单词，计算这些单词的概率分布，训练时也只会更新这些单词对应的权重参数\n",
    "- 如`(fox,quick)`词对，输出层矩阵$300\\times 10000$处理输入中心词`fox`词向量时，只从输出层矩阵中选择上下文词`quick`对应的那列权重向量`(positive word)`，以及随机选择的另外 5 个单词对应的权重向量，来进行更新，此时只需要更新 ($300\\times 6$) 个参数\n",
    "\n",
    "如何选择负样本(Negative Samples)：\n",
    "- 可以通过词频均匀采样：$P(w_i)=\\frac{f(w_i)}{\\sum_{j=0}^{n}(f(w_j))}$\n",
    "- 效果最好的是 3/4 指数采样：$P(w_i)=\\frac{f(w_i)^{3/4}}{\\sum_{j=0}^{n}(f(w_j)^{3/4})}$\n",
    "    - $f(w_i)$ 为单词 $w_i$ 在语料库中出现的概率\n",
    "\n",
    "底层 C 语言实现了包含 100M 元素的数组，称为 unigram table。向表中填充单词对应的索引，该索引出现的次数由 $P(w_i)\\cdot 表的尺寸$ 决定。选择负样本时，只要生成 $0-100M$ 内的 5 个随机数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**参考文章**：\n",
    "     \n",
    "1.http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/\n",
    "2.http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/\n",
    "3.https://ruder.io/word-embeddings-softmax/\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "346.179px"
   },
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
